00. Create/Clone a github repository

01. Create RStudio Project

02. Setup RStudio

Create three folders in RStudio Project: - 00 Doc: where .Rmd and graphics folder live
- 01 Data: where data, csv’s will be stored
- 02 Tableau Visualizations: Where visualizations are saved

03. Download .CSV Files

Data taken from http://popstats.unhcr.org/

04. ETL & Create Table on Database

CREATE TABLE Refugee_Stats ( Asylum_Country varchar2(4000), Origin_Country varchar2(4000), Record_Year number(38,4), Refugees number(38,4), Asylum_seekers number(38,4), Returned_Refugees number(38,4), IDPs number(38,4), Returned_IDPs number(38,4), Stateless_persons number(38,4), Others_of_concern number(38,4), Total_Population number(38,4) );

CREATE TABLE Yearly_Pop_By_Country ( – Change table_name to the table name you want. Country_Name varchar2(4000), Country_Code varchar2(4000), Indicator_Name varchar2(4000), Indicator_Code varchar2(4000), X1960 number(38,4), X1961 number(38,4), X1962 number(38,4), X1963 number(38,4), X1964 number(38,4), X1965 number(38,4), X1966 number(38,4), X1967 number(38,4), X1968 number(38,4), X1969 number(38,4), X1970 number(38,4), X1971 number(38,4), X1972 number(38,4), X1973 number(38,4), X1974 number(38,4), X1975 number(38,4), X1976 number(38,4), X1977 number(38,4), X1978 number(38,4), X1979 number(38,4), X1980 number(38,4), X1981 number(38,4), X1982 number(38,4), X1983 number(38,4), X1984 number(38,4), X1985 number(38,4), X1986 number(38,4), X1987 number(38,4), X1988 number(38,4), X1989 number(38,4), X1990 number(38,4), X1991 number(38,4), X1992 number(38,4), X1993 number(38,4), X1994 number(38,4), X1995 number(38,4), X1996 number(38,4), X1997 number(38,4), X1998 number(38,4), X1999 number(38,4), X2000 number(38,4), X2001 number(38,4), X2002 number(38,4), X2003 number(38,4), X2004 number(38,4), X2005 number(38,4), X2006 number(38,4), X2007 number(38,4), X2008 number(38,4), X2009 number(38,4), X2010 number(38,4), X2011 number(38,4), X2012 number(38,4), X2013 number(38,4), X2014 number(38,4) );

05. Visualizations

Visualization: A Look at Migration Out of Vietnam

Description:

This visualization looks at the number of refugees coming out of Vietnam over the years 1964-2008 (filter: Origin Country by Vietnam). The other filter is to ignore any value that is null. The result is a bar chart plotting aggregated number of refugees versus year, colored by asylum country, with a reference line as the average over all years.

Comments:

  • We can see clearly that there was an abrupt and permanent increase in the number of refugees in 1979. Though showing a lag in time, this corresponds to the end of the Vietnam War, resulting in the 1975 Union of the Communist North & Capitalist South of Vietnam, with the North holding political power. Hence, the data presents the large number of refugees dissenting the new political regime that sought for asylum in other countries.
  • China (colored pink in the bar chart), Vietnam’s “big brother”, has welcomed the largest portion of Vietnamese refugees.
  • The United States (colored blue in the bar chart), also has also taken in quite a number of Vietnamese refugees since 1988. This is interesting considering its involvement in the Vietnam War. (This is 13 years after the war.)

Reproduced with SQL and R

require("jsonlite")
require("RCurl")
require(ggplot2)
require(dplyr)
require(tidyr)

df <- data.frame(fromJSON(getURL(URLencode(gsub("\n", " ", 'skipper.cs.utexas.edu:5001/rest/native/?query=
"""select ASYLUM_COUNTRY, RECORD_YEAR, REFUGEES, ORIGIN_COUNTRY from Refugee_Stats;"""')), httpheader=c(DB='jdbc:oracle:thin:@sayonara.microlab.cs.utexas.edu:1521:orcl', USER='C##cs329e_cjs2599', PASS='orcl_cjs2599', MODE='native_mode', MODEL='model', returnDimensions = 'False', returnFor = 'JSON'), verbose = TRUE))); 

df <- df %>% filter (as.character(ORIGIN_COUNTRY) == "Viet Nam", as.numeric(as.character(REFUGEES)) != "null") %>% group_by(RECORD_YEAR) %>% arrange(desc(as.character(ASYLUM_COUNTRY)))

df2 <- df %>% group_by(RECORD_YEAR) %>% summarise(sum_refugees = sum(as.numeric(as.character(REFUGEES))))

p1 <- ggplot(df, aes(x=as.character(RECORD_YEAR), y = as.numeric(as.character(REFUGEES)), fill = as.character(ASYLUM_COUNTRY))) + 
  geom_bar(stat = "identity") + 
  labs(title='A Look at Migration Out of Vietnam') + 
  labs(x="Year", y="Refugees")

p1 <- p1  +
  geom_text(data = df2, 
            aes(y = sum_refugees, label = sum_refugees, fill = NULL), size = 4,
            vjust = -0.5) +
  geom_hline(data = df2, aes(yintercept = mean(as.numeric(as.character(sum_refugees))))) + 
  annotate("text", x = 1.5, y = 310000, label = 301027, size = 4)

p1

Refugees and Country Size

Description:

Here I attempted a broad survey of asylum-seeking patterns from the perspective of host countries. To accomplish this, I created a calculated field that divided the number of refugees entering a given country in a given year by that country’s total population as of 2014.

Here is the same, created with R and SQL joins:

require("jsonlite")
require("RCurl")
require(ggplot2)
require(dplyr)

df <- data.frame(fromJSON(getURL(URLencode(gsub("\n", " ", 'skipper.cs.utexas.edu:5001/rest/native/?query="""SELECT asylum_country, record_year, (all_refugees / current_pop * 100) normalized_refugees FROM (SELECT asylum_country, record_year, SUM(total_population) all_refugees,
x2014 current_pop FROM (select * from REFUGEE_STATS LEFT JOIN (select cname, continent_code from COUNTRIES) co ON co.cname = REFUGEE_STATS.asylum_country WHERE (REFUGEE_STATS.asylum_country != \\\'Various/Unknown\\\')) refs INNER JOIN (select * from YEARLY_POP_BY_COUNTRY_NUMERIC) pops ON pops.country_name = refs.asylum_country GROUP BY asylum_country, record_year, x2014);"""')), httpheader=c(DB='jdbc:oracle:thin:@sayonara.microlab.cs.utexas.edu:1521:orcl', USER='C##cs329e_cjs2599', PASS='orcl_cjs2599', MODE='native_mode', MODEL='model', returnDimensions = 'False', returnFor = 'JSON'), verbose = TRUE)));

df %>% filter(NORMALIZED_REFUGEES > 3) -> df

ggplot() + 
  coord_cartesian() + 
  scale_x_continuous() +
  scale_y_continuous() +
  labs(title='Refugee Influxes Over Time, Normalized for Host Populations') +
  labs(x=paste("Year"), y=paste("Yearly Refugee Influx as % of Host Population")) +
  layer(data=df, 
        mapping=aes(x=RECORD_YEAR, y=NORMALIZED_REFUGEES, color=ASYLUM_COUNTRY), 
        stat="identity", 
        stat_params=list(), 
        geom="line",
        geom_params=list(), 
        position=position_identity(),
  )

Comments:

  • We find a notable surge in more refugees entering smaller countries in the late 80’s and early 90’s, which historically aligns with both the dissolution of the Soviet Union (and the subsequent conflicts in the Balkans) and the start of the Gulf War. Many factors doubtless contribute to refugee movements in both regions, but at the point of surge it is indeed often Middle Eastern and Balkan countries topping the chart.

  • Some of the most interesting data in this visualization is contained in the mouseover labels. I’ve captured one in the above image as an example, but exploring the graph directly will reveal a surprising trend: almost all of the highest outliers in refugees-as-population-percentage are the result of intra-national displacement: that is, in 1993 Bosnia & Herzegovina found more than 1/3 of its own citizens as refugees in its own country. The same is true for all the highest points on Latvia’s timeline, Cyprus’s, Libera, Timor-Leste’s, Sierra Leone’s, Lebanon’s, and others’. This is consistent with the numerous civil wars erupting in Africa and the Balkans throughout the 1990’s.

  • Finally, this chart shows a global trend of woefully uneven distribution of refugees and asylum seekers. Obviously the size of a country is not the only determinant in its ability to provide refuge to a particular group, but you’ll find no G8 country has ever opened its borders to anywhere near 5% of its population in refugees. The best Western Europe and North America has ever done is Austria in 1951 (the first year of data recorded) at 3.3%

Origins and Asylum

Tableau

-In this visualization, I was interested in exploring if there are any trends in where refugees went based on their origin country. I decided to do a case study of two international powers with advanced economies and trustworthy government records: the United States of America and Germany.

-I was only interested in the number of fefugees and the number of people seeking asylum, and I needed these numbers to be seperate for the calculations. I did two seperate SQL statements which mutated the pertinent columns for USA and GER data, and then performed an inner join on this data along the ‘Origin_Country’ column.

-I then created two calculated fields that found the ratio of refugees to people who sought asylum, one for each country. The ratio of these ratios is the KPI I used, and it tells us if refugees were more likely to seek asylum in Germany or the United States. I used this information to create the cross tab displayed below.

-There are some interesting things we can glean from this information. It appears that the relations between the country of origin and the country of asylum may be important factors in determining if refuges seek official, permanent residence (asylum). A far greater ratio of people from Iran and Iraq saught asylum in Germany than they did in the USA, and the exact oposite it due for those originating in Israel (completely unsuprising statistics).

-A second interesting trend is that people originating on the American continent are equalily likely, if not more likely, to apply for asylum in the USA than they are to simply be refugees (Key examples: Argentina, Brazil, Mexico, Canada, etc etc). However, this trend is NOT consistent in Germany, where there seems to be a random dispersion of Asylum Applicants to refugees from Europe. Countries like England, Spain, and Belgium have almost no records of Asylum seekers. This could be due to the very open borders between those countries, and a relatively small number of catastrophies and conflicts in the region. In almost all of the European cases, the United States had both a higher number of Asylum Seekers, and a higher ratio of Asulum Seekers to Refugees. I’m not sure what to attribpute this to. One could say that those willing to travel to America recognize the difficulty of moving back to their homeland, and therefore apply for perminent residence instead, whereas it is relatively easy to move around in Europe, so they could very easily return home after whatever drove them out ended. However, this is just a theory, and we would need a more through analysis to delve to the bottom of this question.

Work flow in R/SQL

require("jsonlite")
require("RCurl")
require(ggplot2)
require(dplyr)

KPI_Low_Max_value = 0.80     
KPI_Medium_Max_value = 1.20

#getting data for Germany, filter by 
data_GER <- data.frame(fromJSON(getURL(URLencode(gsub("\n", " ", 'skipper.cs.utexas.edu:5001/rest/native/?query="""select Asylum_Country, Origin_Country, sum(Total_Population) as Tot_Ger, sum(Asylum_Seekers) as Asy_Ger, sum(Refugees) as Ref_Ger from REFUGEE_STATS where Asylum_Country = \\\'Germany\\\' group by Asylum_Country, Origin_Country; """')), httpheader=c(DB='jdbc:oracle:thin:@sayonara.microlab.cs.utexas.edu:1521:orcl', USER='C##cs329e_cjs2599', PASS='orcl_cjs2599', MODE='native_mode', MODEL='model', returnDimensions = 'False', returnFor = 'JSON', p1=KPI_Low_Max_value, p2=KPI_Medium_Max_value), verbose = TRUE)));

#getting Data for USA
data_USA <- data.frame(fromJSON(getURL(URLencode(gsub("\n", " ", 'skipper.cs.utexas.edu:5001/rest/native/?query="""select Asylum_Country, Origin_Country, sum(Total_Population) as Tot_USA, sum(Asylum_Seekers) as Asy_USA, sum(Refugees) as Ref_USA from REFUGEE_STATS where Asylum_Country = \\\'United States of America\\\'group by Asylum_Country, Origin_Country;
"""')), httpheader=c(DB='jdbc:oracle:thin:@sayonara.microlab.cs.utexas.edu:1521:orcl', USER='C##cs329e_cjs2599', PASS='orcl_cjs2599', MODE='native_mode', MODEL='model', returnDimensions = 'False', returnFor = 'JSON', p1=KPI_Low_Max_value, p2=KPI_Medium_Max_value), verbose = TRUE))); 

#join the two tables
data_all <- dplyr::inner_join(data_GER, data_USA, by = "ORIGIN_COUNTRY")

data_final <- data_all %>% select (ORIGIN_COUNTRY, ASY_GER, REF_GER, ASY_USA, REF_USA) %>% filter (as.numeric(as.character(ASY_GER)) > -1, as.numeric(as.character(ASY_USA)) > -1, as.numeric(as.character(REF_GER)) > -1, as.numeric(as.character(REF_USA)) > -1) %>% mutate("ref_per_asy_GER" = as.numeric(as.character(REF_GER))/as.numeric(as.character(ASY_GER)), "ref_per_asy_USA" = as.numeric(as.character(REF_USA)) / as.numeric(as.character(ASY_USA)) ) %>% arrange(ORIGIN_COUNTRY) %>% filter("ref_per_asy_GER" > -1, "ref_per_asy_USA" > -1) 

#mutate in KPI
data_final <- data_final %>% mutate (KPI = ifelse((ref_per_asy_GER/ref_per_asy_USA) < KPI_Low_Max_value, "GER", ifelse ((ref_per_asy_GER/ref_per_asy_USA) < KPI_Medium_Max_value, "Neutral", ifelse((ref_per_asy_GER/ref_per_asy_USA) > KPI_Medium_Max_value, "USA", "N/A"))))

spread(df, COLOR, SUM_PRICE) %>% View

ggplot() + 
  coord_cartesian() + 
  scale_x_discrete() +
  scale_y_discrete() +
  labs(title='Asylum Preference of US vs Germany by Origin') +
  labs(x=paste("Values"), y=paste("ORIGIN_COUNTRY")) +
  theme(axis.text.y=element_text(size = 7))+ 
  theme(axis.text.x=element_text(angle=50, size=10, vjust=0.5))+
  layer(data=data_final, 
        mapping=aes(x="ASY_GER", y=as.character(ORIGIN_COUNTRY), label=as.character(ASY_GER)), 
        stat="identity", 
        stat_params=list(), 
        geom="text",
        geom_params=list(colour="black", size=2), 
        position=position_identity()
  ) +
  layer(data=data_final, 
        mapping=aes(x="REF_GER", y=as.character(ORIGIN_COUNTRY), label=as.character(REF_GER)), 
        stat="identity", 
        stat_params=list(), 
        geom="text",
        geom_params=list(colour="black", size = 2), 
        position=position_identity()
  ) +
  layer(data=data_final, 
        mapping=aes(x="ASY_USA", y=as.character(ORIGIN_COUNTRY), label=as.character(ASY_USA)), 
        stat="identity", 
        stat_params=list(), 
        geom="text",
        geom_params=list(colour="black", size = 2), 
        position=position_identity()
  ) +
  layer(data=data_final, 
        mapping=aes(x="REF_USA", y=as.character(ORIGIN_COUNTRY), label=as.character(REF_USA)), 
        stat="identity", 
        stat_params=list(), 
        geom="text",
        geom_params=list(colour="black", size = 2), 
        position=position_identity()
  ) +
  layer(data=data_final, 
        mapping=aes(x="ASY_USA", y=as.character(ORIGIN_COUNTRY), label=as.character(ASY_USA)), 
        stat="identity", 
        stat_params=list(), 
        geom="text",
        geom_params=list(colour="black", size = 2), 
        position=position_identity()
  ) +
  layer(data=data_final, 
        mapping=aes(x="ref per asy GER", y=as.character(ORIGIN_COUNTRY), label=as.character(round(ref_per_asy_GER))), 
        stat="identity", 
        stat_params=list(), 
        geom="text",
        geom_params=list(colour="black", size = 2), 
        position=position_identity()
  ) +
  layer(data=data_final, 
        mapping=aes(x="ref per asy USA", y=as.character(ORIGIN_COUNTRY), label=as.character(round(ref_per_asy_USA))), 
        stat="identity", 
        stat_params=list(), 
        geom="text",
        geom_params=list(colour="black", size = 2), 
        position=position_identity()
  ) +
  layer(data=data_final, 
        mapping=aes(x="ASY_GER", y=as.character(ORIGIN_COUNTRY), fill=KPI), 
        stat="identity", 
        stat_params=list(), 
        geom="tile",
        geom_params=list(alpha=0.50), 
        position=position_identity()
  ) +
  layer(data=data_final, 
        mapping=aes(x="ASY_USA", y=as.character(ORIGIN_COUNTRY), fill=KPI), 
        stat="identity", 
        stat_params=list(), 
        geom="tile",
        geom_params=list(alpha=0.50), 
        position=position_identity()
  ) +
  layer(data=data_final, 
        mapping=aes(x="REF_GER", y=as.character(ORIGIN_COUNTRY), fill=KPI), 
        stat="identity", 
        stat_params=list(), 
        geom="tile",
        geom_params=list(alpha=0.50), 
        position=position_identity()
  ) +
  layer(data=data_final, 
        mapping=aes(x="REF_USA", y=as.character(ORIGIN_COUNTRY), fill=KPI), 
        stat="identity", 
        stat_params=list(), 
        geom="tile",
        geom_params=list(alpha=0.50), 
        position=position_identity()
  ) +
  layer(data=data_final, 
        mapping=aes(x="ref per asy GER", y=as.character(ORIGIN_COUNTRY), fill=KPI), 
        stat="identity", 
        stat_params=list(), 
        geom="tile",
        geom_params=list(alpha=0.50), 
        position=position_identity()
  ) +
  layer(data=data_final, 
        mapping=aes(x="ref per asy USA", y=as.character(ORIGIN_COUNTRY), fill=KPI), 
        stat="identity", 
        stat_params=list(), 
        geom="tile",
        geom_params=list(alpha=0.50), 
        position=position_identity()
  )

-The Visualization in R:

-Of interesting note, the values shown here are very different than those seen in Tableau. Upon investigation, we see that Tableau has some something very strange with the data. For instance, Germany revieced a total of 2 asylum seekers from Australia since these data points were recorded. This is corroberated both in the R-visualization, and in the Oracle Database. However, in tableau, this value increases to 20 Asylum seekers, for no apparent reason. I suspect there was a problem with the join I performed on the data in Tableau.

-That being said, the trends we see in Tableau are largely still valid in the R data. Unsuprisingly, places where the US has tensions with, such as the Russian Federation, Iraq, and Afghanista, have a lower ratio of refugees per asylum seekers in Germany than in Russia. And we observe, once again, that South American Countries have a very small ratio of refugees per asylum seekers.

BLENDING: A Macro View

Here we looked at refugee influxes, but from the perspective of entire continents instead of individual host countries. To acquire this information, we combined our original data with a reference table of countries’ respective continents, like so:

SELECT * FROM 
(SELECT tot_pop, continents.NAME continent FROM CONTINENTS 
INNER JOIN 
(SELECT SUM(X2014) tot_pop, continent_code FROM YEARLY_POP_BY_COUNTRY_NUMERIC 
INNER JOIN COUNTRIES 
ON COUNTRIES.iso3 = YEARLY_POP_BY_COUNTRY_NUMERIC.country_code 
GROUP BY continent_code) cont_pop 
ON continents.code = cont_pop.continent_code) sum_pop 
FULL JOIN 
(select record_year, SUM(total_population) all_refugees, continents.NAME continent 
FROM 
((SELECT * FROM REFUGEE_STATS 
LEFT JOIN 
(SELECT cname, continent_code from COUNTRIES) co 
ON co.cname = REFUGEE_STATS.asylum_country 
WHERE REFUGEE_STATS.asylum_country != 'Various/Unknown') refs 
INNER JOIN 
(SELECT * FROM YEARLY_POP_BY_COUNTRY_NUMERIC) pops 
ON pops.country_name = refs.asylum_country) 
LEFT JOIN continents 
ON continents.CODE = refs.continent_code 
GROUP BY record_year, continents.NAME) rest 
ON rest.continent = sum_pop.continent 
WHERE tot_pop > 0;

The results:

  • Oceania here takes a surprising lead in refugee-haboring. Despite the largest single-country intakes generally occuring in Europe and (to a lesser extent) North Africa in the 90’s, this wider view shows that the vast populations and the large number of non- or slightly-hosting countries on those continents did much to lower the average. Not so in Oceania, the least-populous (barring Antarctica, which was omitted) continent with the fewest individual countries

  • That only deepens the question of South America’s post-2000 surge, however, which is only apparent at the continent level. What happened there? All we can tell from this data is that South America had a rush of refugees in the 2000’s that was not wholly inconsistent with peak refugee crises on other continents, but must have been more evenly spread across individual countries, as South American countries have shown little distinction in the previous nation-level analyses.